linear transformation
Forster Decomposition and Learning Halfspaces with Noise
AForster transform is an operation that turns a distribution into one with good anticoncentration properties. While a Forster transform does not always exist, we show that any distribution can be efficiently decomposed as a disjoint mixture of few distributions for which a Forster transform exists and can be computed efficiently. As the main application of this result, we obtain the first polynomial-time algorithm for distribution-independent PAC learning of halfspaces in the Massart noise model with strongly polynomial sample complexity, i.e., independent of the bit complexity of the examples. Previous algorithms for this learning problem incurred sample complexity scaling polynomially with the bit complexity, even though such a dependence is not information-theoretically necessary.
Shapeshifter: a Parameter-efficient Transformer using Factorized Reshaped Matrices
Language models employ a very large number of trainable parameters. Despite being highly overparameterized, these networks often achieve good out-of-sample test performance on the original task and easily fine-tune to related tasks. Recent observations involving, for example, intrinsic dimension of the objective landscape and the lottery ticket hypothesis, indicate that often training actively involves only a small fraction of the parameter space. Thus, a question remains how large a parameter space needs to be in the first place -- the evidence from recent work on model compression, parameter sharing, factorized representations, and knowledge distillation increasingly shows that models can be made much smaller and still perform well. Here, we focus on factorized representations of matrices that underpin dense, embedding, and self-attention layers. We use low-rank factorized representation of a reshaped and rearranged original matrix to achieve space efficient and expressive linear layers. We prove that stacking such low-rank layers increases their expressiveness, providing theoretical understanding for their effectiveness in deep networks. In Transformer models, our approach leads to more than tenfold reduction in the number of total trainable parameters, including embedding, attention, and feed-forward layers, with little degradation in on-task performance. The approach operates out-of-the-box, replacing each parameter matrix with its compact equivalent while maintaining the architecture of the network.
RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions
We propose a method for metric-scale monocular depth estimation. Inferring depth from a single image is an ill-posed problem due to the loss of scale from perspective projection during the image formation process. Any scale chosen is a bias, typically stemming from training on a dataset; hence, existing works have instead opted to use relative (normalized, inverse) depth. Our goal is to recover metric-scaled depth maps through a linear transformation. The crux of our method lies in the observation that certain objects (e.g., cars, trees, street signs) are typically found or associated with certain types of scenes (e.g., outdoor).
Figure 9: In experiments, we used a common feature-extractor (F
Here, we include implementation details omitted from the main paper for brevity. Upon acceptance, a deanonymized repository will be released. The last layer's dimension depended upon the exact The feature extractors and decoders varied by domain. In particular, we found that if we did not apply this linear transformation (i.e., pass the raw encodings For VQ-based methods, use a large enough codebook to have at least one element per class. Other differences simply reflected differences in architecture (e.g., For iNat, we trained all models with batch size 256, using the hyperparameters specified in Table 3.
A Appendix
In the following subsections, we provide theoretical derivations. In this subsection, we provide a formal description of the consistency property of score matching. Assumption A.4. (Compactness) The parameter space is compact. Assumption A.5. (Identifiability) There exists a set of parameters A.3 are the conditions that ensure A.7 lead to the uniform convergence property [ In the following Lemma A.9 and Proposition A.10, we examine the sufficient condition for We show that the sufficient conditions stated in Lemma A.9 can be satisfied using the Figure A1: An illustration of the relationship between the variables discussed in Proposition 4.1, Lemma A.12, and Lemma A.13. The properties of KL divergence and Fisher divergence presented in the last two rows are derived in Lemmas A.12 In this section, we provide formal derivations for Proposition 4.1, Lemma A.12, and Lemma A.13. Based on Remark A.14, the following holds: D In this section, we elaborate on the experimental setups and provide the detailed configurations for the experiments presented in Section 5 of the main manuscript.